Summary: It's been alittle over three years since I attempted my first ML project of classifying survival of the titanic.. A pretty famous dataset and task for this field I'd say. I achieved a accuracy of around 75% in my first attempt at this (I believe with a random forest model). I though it would be nice to revisit this and give it another go now that I'm a little more experienced. Additionally, I've wanted to explore ensemble learned some more, particularly XGBoost. Let's dive in!

In [1]:
#By Andrew Trick 

Titanic Revisited (w/ XGBoost)

Goal

I originally worked with this dataset about 3.5 years ago when working through Udacity's Nanodegree in Data Analytics. I had, more or less, no idea what I was doing then. I thought it would be nice to revisit this dataset and see if I could get a better accuracy than my first time through (which was around 74% if I recall). Particularly, I've been wanting to work with XGBoost for awhile now and this seemed like an appropriate classification problem to give it a go on!

In [1]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import math
import xgboost

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')
sub = pd.read_csv('input/gender_submission.csv')

train.head()
Out[1]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Feature Generation and Removal

Time to generate a few features which may be of use in classification.

In [2]:
#Does the passanger have a cabin?
train['cabin_binary'] = train["Cabin"].apply(lambda i: 0 if str(i) == "nan" else 1)

#Family Size
train['family_size'] = 1 + train['SibSp'] + train['Parch']
train['solo'] = train["family_size"].apply(lambda i: 1 if i == 1 else 0)

#Fix Nulls
train['Embarked'] = train['Embarked'].fillna('S')
train['Age'] = train['Age'].fillna(int(np.mean(train['Age'])))
train['Fare'] = train['Fare'].fillna(np.mean(train['Fare']))

#A few age specific Binaries
train['Child'] = train["Age"].apply(lambda i: 1 if i <= 17 and i > 6 else 0)
train['toddler'] = train["Age"].apply(lambda i: 1 if i <= 6 else 0)
train['Elderly'] = train["Age"].apply(lambda i: 1 if i >= 60 else 0)

# Fancy fancy
train['fancy'] = train['Fare'].apply(lambda i: 1 if i >= 100 else 0)

# standard
train['standard_fare'] = train['Fare'].apply(lambda i: 1 if i <= 10.0 else 0)

#No requirement to standardize in DT models, but might as well
fare_scaler = StandardScaler()
fare_scaler.fit(train['Fare'].values.reshape(-1, 1))
train['fare_std'] = fare_scaler.transform(train['Fare'].values.reshape(-1, 1))

#get status of passanger
train['title'] = 'default'

for i in train.values:
    name = i[3] #First checks for rare titles (Thanks Anisotropic's wonderful Kernel for inspiration//help here!)
    for e in ['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona']:
        if e in name:
            train.loc[train['Name'] == name, 'title'] = 'rare'
    if 'Miss' in name or  'Mlle' in name or 'Ms' in name or 'Mme' in name or 'Mrs' in name:
        train.loc[train['Name'] == name, 'title'] = 'Ms'
    if 'Mr.' in name or 'Master' in name:
        train.loc[train['Name'] == name, 'title'] = 'Mr'


train.head(10)
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare ... cabin_binary family_size solo Child toddler Elderly fancy standard_fare fare_std title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 ... 0 2 0 0 0 0 0 1 -0.502445 Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 ... 1 2 0 0 0 0 0 0 0.786845 Ms
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 ... 0 1 1 0 0 0 0 1 -0.488854 Ms
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 ... 1 2 0 0 0 0 0 0 0.420730 Ms
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 ... 0 1 1 0 0 0 0 1 -0.486337 Mr
5 6 0 3 Moran, Mr. James male 29.0 0 0 330877 8.4583 ... 0 1 1 0 0 0 0 1 -0.478116 Mr
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 ... 1 1 1 0 0 0 0 0 0.395814 Mr
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 ... 0 5 0 0 1 0 0 0 -0.224083 Mr
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 ... 0 3 0 0 0 0 0 0 -0.424256 Ms
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 ... 0 2 0 1 0 0 0 0 -0.042956 Ms

10 rows × 22 columns

Lets send the test data through the same pipeline!

In [3]:
#Does the passanger have a cabin?
test['cabin_binary'] = test["Cabin"].apply(lambda i: 0 if str(i) == "nan" else 1)

#Family Size
test['family_size'] = 1 + test['SibSp'] + test['Parch']
test['solo'] = test["family_size"].apply(lambda i: 1 if i == 1 else 0)

#Fix Nulls
test['Embarked'] = test['Embarked'].fillna('S')
test['Age'] = test['Age'].fillna(int(np.mean(test['Age'])))
test['Fare'] = test['Fare'].fillna(np.mean(test['Fare']))

#A few age specific Binaries
test['Child'] = test["Age"].apply(lambda i: 1 if i <= 17 and i > 6 else 0)
test['toddler'] = test["Age"].apply(lambda i: 1 if i <= 6 else 0)
test['Elderly'] = test["Age"].apply(lambda i: 1 if i >= 60 else 0)

# Fancy fancy
test['fancy'] = test['Fare'].apply(lambda i: 1 if i >= 100 else 0)
test['standard_fare'] = test['Fare'].apply(lambda i: 1 if i <= 10.0 else 0)

#standardize
test['fare_std'] = fare_scaler.transform(test['Fare'].values.reshape(-1, 1))

#get status of passanger
test['title'] = 'default'

for i in test.values:
    name = i[2] #First checks for rare titles (Thanks Anisotropic's wonderful Kernel for inspiration//help here!)
    for e in ['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona']:
        if e in name:
            test.loc[test['Name'] == name, 'title'] = 'rare'
    if 'Miss' in name or  'Mlle' in name or 'Ms' in name or 'Mme' in name or 'Mrs' in name:
        test.loc[test['Name'] == name, 'title'] = 'Ms'
    if 'Mr.' in name or 'Master' in name:
        test.loc[test['Name'] == name, 'title'] = 'Mr'


test.head(10)
Out[3]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin ... cabin_binary family_size solo Child toddler Elderly fancy standard_fare fare_std title
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN ... 0 1 1 0 0 0 0 1 -0.490783 Mr
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN ... 0 2 0 0 0 0 0 1 -0.507479 Ms
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN ... 0 1 1 0 0 1 0 1 -0.453367 Mr
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN ... 0 1 1 0 0 0 0 1 -0.474005 Mr
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN ... 0 3 0 0 0 0 0 0 -0.401017 Ms
5 897 3 Svensson, Mr. Johan Cervin male 14.0 0 0 7538 9.2250 NaN ... 0 1 1 1 0 0 0 1 -0.462679 Mr
6 898 3 Connolly, Miss. Kate female 30.0 0 0 330972 7.6292 NaN ... 0 1 1 0 0 0 0 1 -0.494810 Ms
7 899 2 Caldwell, Mr. Albert Francis male 26.0 1 1 248738 29.0000 NaN ... 0 3 0 0 0 0 0 0 -0.064516 Mr
8 900 3 Abrahim, Mrs. Joseph (Sophie Halaut Easu) female 18.0 0 0 2657 7.2292 NaN ... 0 1 1 0 0 0 0 1 -0.502864 Ms
9 901 3 Davies, Mr. John Samuel male 21.0 2 0 A/4 48871 24.1500 NaN ... 0 3 0 0 0 0 0 0 -0.162169 Mr

10 rows × 21 columns

Remove Unneccesary Features and Encode Categorical

In [4]:
train = pd.get_dummies(train, columns=["Sex", "Embarked", "title"])
test = pd.get_dummies(test, columns=["Sex", "Embarked", "title"])

train = train.drop(['Name','PassengerId', 'Ticket', 'Cabin', 'Fare', 'SibSp'], axis = 1)
test = test.drop(['Name','PassengerId', 'Ticket', 'Cabin', 'Fare', 'SibSp'], axis = 1)

train.head()
Out[4]:
Survived Pclass Age Parch cabin_binary family_size solo Child toddler Elderly ... standard_fare fare_std Sex_female Sex_male Embarked_C Embarked_Q Embarked_S title_Mr title_Ms title_rare
0 0 3 22.0 0 0 2 0 0 0 0 ... 1 -0.502445 0 1 0 0 1 1 0 0
1 1 1 38.0 0 1 2 0 0 0 0 ... 0 0.786845 1 0 1 0 0 0 1 0
2 1 3 26.0 0 0 1 1 0 0 0 ... 1 -0.488854 1 0 0 0 1 0 1 0
3 1 1 35.0 0 1 2 0 0 0 0 ... 0 0.420730 1 0 0 0 1 0 1 0
4 0 3 35.0 0 0 1 1 0 0 0 ... 1 -0.486337 0 1 0 0 1 1 0 0

5 rows × 21 columns

Quick EDA Visuals

Lets do just a little bit of EDA with Seaborn:

In [9]:
#COR MATRIX OF numerical vars
plt.figure(figsize=(14,12))
plt.title('Corelation Matrix', size=8)
sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0, 
            square=True, cmap=plt.cm.RdBu, linecolor='white', annot=True)
plt.show()
In [10]:
#Family size histo
sns.distplot(train['family_size'])
plt.show()
In [11]:
#boxplot of family size and survival
sns.boxplot("Survived", y="family_size", data = train)
plt.show()
In [12]:
#Fare to Age relationship?
sns.lmplot('fare_std', 'Age', data = train, 
           fit_reg=False,scatter_kws={"marker": "D", "s": 20}) 
plt.show()
In [13]:
#boxplot of age and survival
sns.boxplot("Survived", "Age", data = train)
plt.show()
In [14]:
#bar chart of age and survival
sns.lmplot("Survived", "fare_std", data = train, fit_reg=False)
plt.show()

Classifing with XGBoost

I'll be comparing XG to AdaBoost, GradientBoost, RandomForest, andddd maybe a SVC or something else.

First off to split the train data into a train/test for cross validation testing

In [15]:
X = train.drop(['Survived'], axis = 1)
y = train['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Alright, first letsfirst try some more traditional models.. AKA: Random Forest and Standard Tree

In [12]:
#Random Forest Setup
ranfor = RandomForestClassifier()
parameters = {'n_estimators':[10,50,100], 'random_state': [42, 138], \
              'max_features': ['auto', 'log2', 'sqrt']}
ranfor_clf = GridSearchCV(ranfor, parameters)
ranfor_clf.fit(X_train, y_train)


'''CROSS VALIDATE'''
cv_results = cross_validate(ranfor_clf, X_train, y_train)
cv_results['test_score']  

y_pred = ranfor_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
0.810055865922
In [16]:
##Decision Tree Go
dt = DecisionTreeClassifier()
parameters = {'random_state': [42, 138],'max_features': ['auto', 'log2', 'sqrt']}
dt_clf = GridSearchCV(dt, parameters)
dt_clf.fit(X_train, y_train)

y_pred = dt_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
0.776536312849

At 81% with a random forest.. I'm stoked as this is already better than my last attempt. Lets keep pushing on with it and give some boosting models a go.

In [47]:
ada = AdaBoostClassifier(base_estimator = DecisionTreeClassifier())
parameters = {'n_estimators':[10,50,100], 'random_state': [42, 138], 'learning_rate': [0.1, 0.5, 0.8, 1.0]}
ada_clf = GridSearchCV(ada, parameters)
ada_clf.fit(X_train, y_train)

cv_results = cross_validate(ada_clf, X_train, y_train)
cv_results['test_score']  

y_pred = ada_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
0.8156424581005587
In [17]:
gradBoost = GradientBoostingClassifier()
parameters = {'n_estimators':[10,50,100], 'random_state': [42, 138], 'learning_rate': [0.1, 0.5, 0.8, 1.0], \
             'loss' : ['deviance', 'exponential']}
gb_clf = GridSearchCV(gradBoost, parameters)
gb_clf.fit(X_train, y_train)

cv_results = cross_validate(gb_clf, X_train, y_train)
cv_results['test_score']  

y_pred = gb_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
0.821229050279

So GradientBoosting has given me the best score at 82% so far.. lets finally make our way to XG:

In [13]:
xg = xgboost.XGBClassifier(max_depth = 3, n_estimators = 400, learning_rate = 0.1)
xg.fit(X_train, y_train)

cv_results = cross_validate(xg, X_train, y_train)
cv_results['test_score']  

y_pred = xg.predict(X_test)
print(accuracy_score(y_test, y_pred))
0.832402234637
In [85]:
'''Confusion Matrix'''
y_pred = xg.predict(X_test)
# TN, FP, FN, TP
confusion_matrix(y_test, y_pred)
Out[85]:
array([[91, 14],
       [17, 57]])

So.. A liiiittttleee better than the Gradient boost. I'm happy with it for a quick project like this. Lets write it out and submit.

In [13]:
xg = xgboost.XGBClassifier(max_depth = 3, n_estimators = 400, learning_rate = 0.1)
xg.fit(X, y)

cv_results = cross_validate(xg, X, y)
predictions = xg.predict(test)

sub['Survived'] = predictions
sub.to_csv("first_submission_xgb.csv", index=False)